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Introduction 


The wCorr package can be used to calculate Pearson, Spearman, polyserial, and polychoric 
correlations, in weighted or unweighted form.' The package implements the tetrachoric 
correlation as a specific case of the polychoric correlation and biserial correlation as a specific 
case of the polyserial correlation. When weights are used, the correlation coefficients are 
calculated with so called sample weights or inverse probability weights.” 


This vignette introduces the methodology used in the wCorr package for computing the Pearson, 
Spearman, polyserial, and polychoric correlations, with and without weights applied. For the 
polyserial and polychoric correlations, the coefficient is estimated using a numerical likelihood 
maximization. 


The weighted (and unweighted) likelihood functions are presented. Then simulation evidence is 
presented to show correctness of the methods, including an examination of the bias and 
consistency. This is done separately for unweighted and weighted correlations. 


Numerical simulations are used to show: 


¢ The bias of the methods as a function of the true correlation coefficient (9) and the 
number of observations (n) in the unweighted and weighted cases; and 


¢ The accuracy [measured with root mean squared error (RMSE) and mean absolute 
deviation (MAD)] of the methods as a function of p and n in the unweighted and 
weighed cases. 


Note that here “bias” is used for the mean difference between true correlation and estimated 
correlation. 


The wCorr Arguments vignette? describes the effects the Maximum Likelihood(ML) and fast 
arguments have on computation and gives examples of calls to wCorr. 


Specification of estimation formulas 


Here we focus on specification of the correlation coefficients between two vectors of random 
variables that are jointly bivariate normal. We call the two vectors X and Y. The i‘” members of 
the vectors are then called x; and y;. 


' The estimation procedure used by the wCorr package for the polyserial is based on the likelihood function in Cox, 
N. R. (1974), "Estimation of the Correlation between a Continuous and a Discrete Variable." Biometrics, 30 (1), pp 
n171-178. The likelihood function for polychoric is from Olsson, U. (1979) "Maximum Likelihood Estimation of 
the Polychoric Correlation Coefficient." Psyhometrika, 44 (4), pp 443-460. The likelihood used for Pearson and 
Spearman is written down in many places. One is the "correlate" function in Stata Corp, Stata Statistical Software: 
Release 8. College Station, TX: Stata Corp LP, 2003. 

2 Sample weights are comparable to pweight in Stata. 

3 The wCorr Arguments vignette can be found at https://cran.r-project.org/web/packages/wCorr/vignettes/wCorr 


Arguments.html 
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Formulas for Pearson correlations with and without weights 


The weighted Pearson correlation is computed using the formula 


dizi lWi(x% — X)O%i — ¥)] 


Tpearson — 


dies Wi (%; — X)?) Dia (Wii — Y)7) 


where w; is the weights, x and y are the weighted mean of the X and Y respectively, and n is the 
number pairs (xj, y;).* 


The unweighted Pearson correlation is calculated by setting all of the weights to one. 
Formulas for Spearman correlations with and without weights 


For the Spearman correlation coefficient the unweighted coefficient is calculated by ranking the 
data and then using those ranks to calculate the Pearson correlation coefficient—so the ranks 
stand in for the X and Y data. Again, similar to the Pearson, for the unweighted case the weights 
are all set to one. 


For the unweighted case the highest rank receives a value of | and the second highest 2, and so 
on down to the n“” value. In addition, when data are ranked, ties must be handled in some way. 
The chosen method is to use the average of all tied ranks. For example, if the second and third 
rank units are tied then both units would receive a rank of 2.5 (the average of 2 and 3). 


For the weighted case there is no commonly accepted weighted Spearman correlation coefficient. 
Stata does not estimate a weighted Spearman and SAS neither documents nor cites their 
methodology in either of the corr or freq procedures. 


The weighted case presents two issues. First, the ranks must be calculated. Second, the 
correlation coefficient must be calculated. 


Calculating the weighted rank for an individual level is done via two terms. For the jth element 
the rank is 


rank; =aj+ b; 


The first term a; is the sum of all weights W less than or equal to this value of the outcome being 
ranked (€;) 


qj =) will < ;) 


4 See the "correlate" function in Stata Corp, Stata Statistical Software: Release 8. College Station, TX: Stata Corp 
LP, 2003. 
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Where 


lifsi<sj 
1G <8) =(oipe ae 


w; is the ith weight and ¢; and ; are the ith and jth value of the vector being ranked, 
respectively. 


The term b; then serves two roles: making the ranks start with one (when there are no ties) and 
making the ranks the average of the ties (when there are ties). 


When there are ties, each unit receives the mean rank for all of the tied units. In the simplest case 
the weights are all one and there are n tied units the vector of tied ranks would be v = 

(a; + 1, a; + 2, .,A; + n) . The mean of this vector (here called rank? to indicate it is a specific 
case of rank when the weights are all one) is then 


n 


1 
rank} = IC + i) 
i=1 
a n(n + >) 
==(na, + y) 
n+1 
=r 
= aj + bi 
Where 
4 1 
bj = 5 


where the superscript one is again used to indicate that this is only for the unweighted case where 
all weights are set to one. 


Performing an analogous computation as above, the total number of units tied for this rank is 
taken to be the sum of all of the weights. And the average of these ranks is then taken to be that 
value divided by the number of sample units tied for this rank. So 


n+1_ 
by = 2 y; 


where w; is the mean weight of all of the tied units. It is easy to see that if w; = 1 for all j then 
bY =p! 
J ie 
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After the X and Y vectors are ranked, they are plugged into the weighted Pearson correlation 
coefficient formula shown earlier. 


Polyserial correlation 


For the polyserial correlation, it is again assumed that there are two continuous variables X and Y 
that have a bivariate normal distribution.” 


(*) ~ N[A,Z] 


where N(A, Z) is a bivariate normal distribution with mean vector A = (uy, Uy) and & is the 
covariance matrix of X and Y. Where X is continuous but Y is discretized to M. For example, if Y 
is partitioned as: (—00, —2, —0.5,1.6, 00), then m; = 2 when —2 < y; < —0.5. 


Figure 1. Density of Y for Cut Points 6 = (—, —2,—0.5,1.6, 0). 


density 


Ey 0.5 1.6 


Y 


Let fy and oy be the mean and standard deviation of the random variable Y. Because a 
transformation of Yto Y + a where a is any real number can be offset by adjusting all cut points 
by a, the mean is irrelevant. A similar argument holds for oy. For convenience, we set, without 
any loss of generality, uy = 0 and oy = 1. 


Cox (1974) observed that the MLE mean and standard deviation of X are simply the average and 
(population) standard deviation of the data and do not depend on any other parameters.° This can 


be taken advantage of by substituting x by its standardized variable Z. 


Combining these simplifications, the probability of any given x;, m; pair can be written as: 


5 For a more complete treatment of the polyserial correlation, see Cox, N. R., "Estimation of the Correlation between 
a Continuous and a Discrete Variable" Biometrics, 50 (March), 171-187, 1974. 

® The population standard deviation is used because it is the MLE for the standard deviation. Notice that, while the 
sample variance is an unbiased estimator of the variance and the population variance is not an unbaised estimator of 
the variance, they are very similar and the variance is also a nuisance parameter, not a parameter of interest when 
finding the correlation. 


American Institutes for Research Weighted and Unweighted Correlation Methods—4 


Omj;+1 
Pr(p = r,O = 0;Z = z;,M =m) = oa { fQ|Z = z,p =r)dy 
r) 


mj 


where Pr(p = r,O = 8;Z = z;,M = m;) is the probability of the event p = r and the cut points 
are given by the vector 0, given the ith data point z; and m;; @(-) is the standard normal; and 

f (|Z, p) is the distribution of Y conditional on Z and p. Because Y and Z are jointly normally 
distributed (by assumption) 


o 
f(V|lZ=z,p=r)~N (us + (Zi —Uz),(1- r?)oy?) ~ N(r.z;,(1 —r?)) 
Zz 


because Z~N(0,1) and Y~N(0,1). 


Because the random variable defined as —= has a standard normal distribution, we can now 
write 
Omps2 — 1° Zi Oma —* Zi 
Pr(p = r,0 = 0;Z = z;,M = m) = $(z; [> (ee) — 9 (Seen 
p l l p( i) 1 _ r2 Vi—-rz 


where ®(-) is the standard normal cumulative density function. 
Using the above probability function, the likelihood function is: 
n 
L(p =r,0 = 9;Z =z,M =m) = | [Prt =r,0=0;Z =z;,M =m,)” 
i=1 


We then have to find the value of p that maximizes the likelihood function. Because the natural 
log function is monotonically increasing, we can maximize 


n 
In(L) = > In[Pr(p = r,O = 0;Z = z;,M = m)] 
i=1 


instead, where w; is the weight of the i*" member of the vectors Z and Y. For the unweighted 
case, all of the weights are set to one. 


In the unweighted case, the value of the nuisance parameter @ is chosen to be 
hy =i 
Gj42 = P-*(n/N) 


where 7 is the number of values to the left of the jth cut point (@;,2 value) and N is the number 
of data points overall. Here two is added to j to make the indexing of 8 agree with Cox (1974) as 
noted before. For the weighted cause n is replaced by the sum of the weights to the left of the jth 
cut point and N is replaced by the total weight of all units 
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A se (a8 I(m; <2) 
j+2 YW; 


Because the numerical optimization is not perfect when the correlation is in a boundary condition 
(p € {—1,1}), a check for perfect correlation is performed before the above optimization by 
simply examining if the values of X and M have weekly agreeing order (or opposite but agreeing 
order) and then the MLE correlation of 1 (or -1) is returned. Where weak agreement means that 
there is no disagreement—allowing for M to have different associated values of X. 


Polychoric correlation 


As with the polyserial correlation, the polychoric correlation is a simple case of two continuous 
variables X and Y that have a bivariate normal distribution. But now both variables are 
discretized. The continuous (latent) variable Y was observed as a discretized variable M and the 
continuous (latent) variable X is discretized into P. 


The random variable P has the same properties as the M defined above. 
Then the probability of any given pair (mj, p;) is 
or pjt1 


O@mj+1 
Pri(p=7,0=6,6' =0;P=9,M=m,)= i f(x yle =r)dydx 
@ 


or Di mj 
where p is the correlation coefficient, 6 is the cut points for M and 6’ is the cut points for P. 


Using the probability, the log-likelihood is defined using the same logic as above. The natural 
logarithm of the likelihood function is then 


n 
> In[Pr(p = r, 6 = 0,6’ = 0’; P = p,, M =™m,)] 
i=1 


is then maximized. This is the weighted log-likelihood function. For the unweighted case set the 
weights to one. 


Again, extreme values (correlations of | or -1) can be detected by testing for weakly ascending 
or descending associations between the discrete variables. 


Simulation results 


It is easy to prove the consistency of the @ for the polyserial correlation and ® and 60’ for the 
polychoric correlation using the non-ML case. Similarly, for p, because it is a MLE that can be 
obtained by taking a derivative and setting it equal to zero, the results are asymptotically 
unbiased and obtain the Cramer-Rao lower bound. 
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This does not speak to the small sample properties of these correlation coefficients. Previous 
work has described their properties by simulation; and that tradition is continued below.’ 


Simulation study of unweighted correlations 


In what follows, when the exact method of selecting a parameter (such as 7) is not noted in the 
above descriptions, it is described as part of each simulation. 


For each iteration (the exact number of times will be stated for each simulation), the following 
procedure is used: 


* select a true Pearson correlation coefficient p; 
¢ select the number of observations n; 


* generate X and Y to be bivariate normally distributed using a pseudo-Random Number 
Generator (RNG); 


* using a pseudo-RNG, select the number of bins for M and P (t and t’) independantly 
from the set {2, 3, 4, 5};° 


* select the bin boundaries for M and P (0 and 6’) by sorting the results of (¢ — 1) and 
(t’ — 1) draws, respectively, from a normal distribution using a pseudo-RNG; 


* confirm that at least 2 levels of each of M and P are occupied (if not, return to previous 
step); and 


¢ calculate and record relevant statistics. 
Bias, and RMSE of the unweighted correlations 
This section shows the bias of the correlations as a function of the true correlation coefficient, p. 


A simulation that consist of fifty computations, was done for is pair (p,n) where p € 
(—0.99, —0.95, —0.90, —0.85,..., 0.95, 0.99), and n € {10,100,1000}. 


The bias and the RSME defined respectively as <A; — p;) and = 10; — pi)? where the 


true correlation coefficient is p; and estimate correlation coefficient r; were evaluated for the 
spearman, polyserial, polychoric and spearman correlations. 


Figure 2 shows the bias as a function of the true correlation p. The simulation study shows that 
only the polyserial shows no bias at any level of n, shown by no clear deviation from 0 at any 
level of p. For the Pearson correlation there is bias when n = 10 that is not present when n = 


7 See, for example, the introduction to Rigdon, E. E. and Ferguson C. E., "The Performance of the Polychoric 
Correlation Coefficient and Selected Fitting Functions in Confirmatory Factor Analysis With Ordinal Data" Journal 
of Marketing Research 28 (4), pp. 491-497. 

8 This means that the simulation uses discrete ordinal variables (M and P) that have 2, 3, 4, or 5 discrete levels. Note 
that the number of levels in M and P are chosen independently so that one could be 2 while the other is 5 (or any 
other possible combination). 
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100 or 1,000. This is a well-known property of the estimator.’ Similarly, the polychoric shows 
bias when n = 10. 


The Spearman correlation shows bias at all of the tested levels of n. The bias is zero when the 
true correlation is 1, 0, or -1; is positive when p is below 0 (negative correlation); and is negative 
when p is above 0 (positive correlation). In this section, the Spearman correlation coefficient is 
compared with the true Pearson correlation coefficient. When this is done, the bias is expected 
because the Spearman correlation is not intended to recover a Pearson type correlation 
coefficient; it is designed to measure a separate quantity. 


Figure 2. Bias Versus p for Unweighted Correlations 
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Figure 3 shows the RMSE as a function of p. All of the correlation coefficients have a uniform 
RMSE when as a function of p—> 0 that decreases when 0 < |p| < 1. All plots also show a 
decrease in RMSE as n increases. This plot shows that there is no appreciable RMSE differences 
as a function of p. In addition, it shows that our attention to the MLE correlation of -1 or 1 at 
edge cases did not make the RMSE much worse in the neighborhood of the edges (|p| > 1). 


° See, for example, Olkin I. and Pratt, J. W. (1958). Unbiased estimation of certain correlation coefficients. Annals 
of Mathematical Statistics, 29(1), 201-211. 
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Figure 3. Root Mean Square Error Versus p for Unweighted Correlations 
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Consistency of the correlations 


Figure 4 shows the RMSE as a function of n. The purpose of this plot is not to show an 
individual value but to show that the estimator is consistent. The plot shows a slope of about — - 


for the Pearson, polychoric, and polyserial correlations. This is consistent with the expected first 
order convergence for each correlation coefficient under the assumptions of this simulation. 
Results for the Spearman also show approximate first order convergence but the slope increases 
slightly as n increases. Again, the Spearman is not estimating the same quantity as the Pearson 
and so is expected to diverge. 


The plot also shows that the RMSE is less than 0.1 for all methods when n > 100. 
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Figure 4. Root Mean Square Error Versus Sample Size for Unweighted Correlations 
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Computing Time 


Figure 5 shows the mean time (in seconds) to compute a single correlation coefficient as a 
function of p and the value of n. The plot shows linearly rising computation times with slopes of 
about one. This is consistent with a linear computation cost. Using Big O notation to denote the 
computation cost, we see that the slopes are all about 1, so we can conclude that the overall 
complexity of these algorithms is O(n). The Spearman has a slightly higher slope, consistent with 
the use of a sort step that is O(n log(n)). 


Figure 5. Computation time 
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Simulation study of weighted correlations 


When complex sampling (other than simple random sampling with replacement) is used, 
unweighted correlations may or may not be consistent. In this section the consistency of the 
weighted coefficients is examined. 


When generating simulated data, decisions about the generating functions have to be made. 
These decisions affect how the results are interpreted. For the weighted case, if these decisions 
lead to something about the higher weight cases being different from the lower weight cases then 
the test will be more informative about the role of weights. Thus, while it is not reasonable to 
always assume that there is a difference between the high and low weight cases, the assumption 
(used in the simulations below) that there is an association between weights and the correlation 
coefficients serves as a more robust test of the methods in this package. 


Results of weighted correlation simulations 


Simulations are carried out in the same fashion as previously described but include a few extra 
steps to accommodate weights. The following changes were made: 


* Weights are assigned according to w; = (x — y)* + 1, and the probability of inclusion in 
the sample was then Pr; = = 


i 
¢ For each unit, a uniformly distributed random number was drawn. When that value was 
less than the probability of inclusion (Pr;), the unit was included. 


Units were generated until n units were in the sample. 


Two simulations were run. The first shows the mean absolute deviation (MAD) 


n 
1 
MAD ==) |n—pil 


i=1 


as a function of p and was run for n = 100 and p € 
(—0.99, —0.95, —0.90, —0.85,...,0.95,0.99), with 100 iterations run for each value of p. 


The following plot shows the MAD for the weighted and unweighted results as a function of p 
when n = 100. This shows that for values of p near zero, under our simulation assumptions (for 
all but the Spearman correlation) the weighted correlation performs better than (has lower MAD 
than) the unweighted correlation for all correlation coefficients. Over the entire range, the 
difference between the two is never such that the unweighted has a lower MAD. Thus, under the 
simulated conditions at least, the weighted correlation has lower or approximately equal MAD 
for every value of the true correlation coefficient (p). 
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Figure 6. Mean Absolute Deviation Versus p (Weighted) 
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The second simulation (shown in Figure 7) used the same values of p and used n € 
{10,100,1000,10000} and shows how RMSE and sample size are related. In particular, it shows 
first order convergence of the weighted Pearson, polyserial, and polychoric correlation 
coefficient. 


For the previous plots the calculated Spearman correlation coefficient was compared to the 
generating Pearson correlation coefficient. For this plot only, the Spearman correlation 
coefficient to the true Spearman correlation coefficient. This is because the Spearman coefficient 
is not attempting to estimate the Pearson correlation. To do this, the simulation is modified 
slightly. A population of data is generated and the true Spearman correlation coefficient then is 
calculated as the population coefficient.'° Then, a sample from the population with varying 
probability as described in the weighted simulation section is used to calculate sample Spearman 
correlation coefficient. Then, the root mean squared difference between the sample and 
population coefficients are calculated as with the Pearson—except that the population Spearman 
correlation coefficient is used in place of the Pearson correlation coefficient (9). 


0 The R stats package cor function is used to calculate the population Spearman correlation coefficient; this 
results in an unweighted coefficient, which is appropriate for the population parameter. 
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Thus, the results in Figure 7 show that, when compared to the true Spearman correlation 
coefficient, the weighted Spearman correlation coefficient is consistent. 


In all cases the RMSE is lower for the weighted than the unweighted. Again, the fact that the 
simulations show that the unweighted correlation coefficient is not consistent does not imply that 
it will always be that way—only that this is possible for these coefficients to not be consistent. 


Figure 7. Root Mean Square Error vs p (Polyserial, Pearson, Polychoric panels) or Population 
Spearman Correlation Coefficient (Spearman Panel) for Weighted and Unweighted Correlations 
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Conclusion 


Overall, the simulations show first-order convergence for each unweighted correlation 
coefficient with an approximately linear computation cost. Further, under our simulation 
assumptions, the weighted correlation performs better than (that is, has lower MAD or RMSE 
than) the unweighted correlation for all correlation coefficients. 


We show the first-order convergence of the weighted Pearson, polyserial, and polychoric 
correlation coefficient. The Spearman is shown to not consistently estimate the population 
Pearson correlation coefficient but is shown to consistently estimate the population Spearman 
correlation coefficient—under the assumptions of our simulation 
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